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Abstract 

Zero-shot learning has received increasing interest as a means to alleviate the of¬ 
ten prohibitive expense of annotating training data for large scale recognition problems. 
These methods have achieved great success via learning intermediate semantic represen¬ 
tations in the form of attributes and more recently, semantic word vectors. However, 
they have thus far been constrained to the single-label case, in contrast to the growing 
popularity and importance of more realistic multi-label data. In this paper, for the first 
time, we investigate and formalise a general framework for multi-label zero-shot learn¬ 
ing, addressing the unique challenge therein: how to exploit multi-label correlation at test 
time with no training data for those classes? In particular, we propose (I) a multi-output 
deep regression model to project an image into a semantic word space, which explicitly 
exploits the correlations in the intermediate semantic layer of word vectors; (2) a novel 
zero-shot learning algorithm for multi-label data that exploits the unique composition- 
ality property of semantic word vector representations; and (3) a transductive learning 
strategy to enable the regression model learned from seen classes to generalise well to 
unseen classes. Our zero-shot learning experiments on a number of standard multi-label 
datasets demonstrate that our method outperforms a variety of baselines. 


1 Introduction 

There are around 30,000 human-distinguishable basic object classes [□] and many more sub¬ 
ordinate ones. A major barrier to progress in visual recognition is thus collecting training 
data for many classes. Zero-shot learning (ZSL) strategies have therefore gained increasing 
interest as a route to side-step this prohibitive cost, as well as enabling potential new cate¬ 
gories emerging over time to be represented and recognised. To classify instances from a 
class with no examples, ZSL exploits knowledge transferred from a set of seen (auxiliary) 
classes to unseen (test) classes, typically via an intermediate semantic representation such as 
attributes. This has recently been explored at large scale on ImageNet [□, EE]. 

Prior zero-shot learning methods have assumed that class labels on each instance are mu¬ 
tually exclusive, i.e., multi-class single label classification. Nevertheless many real-world 
data are intrinsically multi-label. For example, an image on Flickr often contains multiple 
objects with cluttered background, thus requiring more than one label to describe its content. 
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There is an even more acute need for zero-shot learning in the case of multi-label classifica¬ 
tion. This is because different labels are often correlated (e.g. cows often appear on grass). 
In order to better predict these labels given an image, the label correlation must be modelled. 
However, for n labels, there are 2" possible multi-label combinations and to collect suffi¬ 
cient training samples for each combination to learn the correlations of labels is infeasible. 
It is thus surprising to note that there is little if any existing work on multi-label zero-shot 
learning. Is it because there is a trivial extension of existing single label ZSL approaches 
to this new problem? By assuming each label is independent from one another, it is indeed 
possible to decompose a multi-label ZSL problem into multiple single label ZSL problems 
and solve them using existing single label ZSL methods. However this does not exploit label 
correlation, and we demonstrate in this work that this naive extension leads to very poor 
label prediction for unseen classes. Any attempt to model this correlation, in particular for 
the unseen classes with zero-shot, is extremely challenging. 

In this paper, a novel framework for multi-label zero-shot learning is proposed. Our 
framework is based on transfer learning - given a training/auxiliary dataset containing la¬ 
belled images, and a test/target dataset with a set of unseen labels/classes (i.e. none of the 
labels appear in the training set), we aim to learn a multi-label classification model from the 
training set and generalise/transfer it to the test set with unseen labels. This knowledge trans¬ 
fer is achieved using an intermediate semantic representation in the form of the skip-gram 
word vectors [IZ3, Q] learned from linguistic knowledge bases. This representation is shared 
between the training and test classes, thus making the transfer possible. 

More specifically, our framework has two main components: multi-output deep regres¬ 
sion (Mul-DR) and zero-shot multi-label prediction (ZS-MLP). Mul-DR is a 9 layer neural 
network that exploits the widely used convolutional neural network (CNN) layers [123], and 
includes two multi-output regression layers as the final layers. It learns from auxiliary data 
the explicit and direct mapping from raw image pixels to a linguistic representation defined 
by the skip-gram language model [123, E3]. With Mul-DR, each test image is now projected 
into the semantic word space where the unseen labels and their combinations can be repre¬ 
sented as data points without the need to collect any visual data. ZS-MLP aims to address 
the multi-label ZSL problem in this semantic word space. Specifically, we note that in this 
space any label combination can be synthesised. We thus exhaustively synthesise the power 
set of all possible prototypes (i.e., combinations of multi-labels) to be treated as if they were 
a set of labelled instances in the space. With this synthetic dataset, we are able to extend 
conventional multi-label algorithms [O, O, E3, El], to propose two new multi-label algo¬ 
rithms - direct multi-label zero-shot prediction (DMP) and transductive multi-label zero-shot 
prediction (TraMP). However, since Mul-DR is learned using the auxiliary classes/labels, it 
may not generalise well to the unseen classes/labels. To overcome this problem, we further 
exploit self-training to adapt the Mul-DR to the test classes to improve its generalisation 
capability. 


2 Related Work 

Multi-label classification Multi-label classification has been widely studied - for a review 
of the field please see [E3, E3]. Most previous studies assume plenty of training data. Re¬ 
cently efforts have been made to relax this assumption. Kong et al. [D] studied transductive 
multi-label learning with a small set of training instances. Hariharan et al. [O] explored 
the label correlations of auxiliary data via a multi-label max-margin formulation and bet- 



FU FT AL: TRANSDUCTIVE MULTI-LABEL ZERO-SHOT LEARNING 


3 


ter incorporated such label correlations as prior for multi-class zero-shot learning problem. 
However, none of them addresses the multi-label zero-shot learning problem tackled in this 
work. 

Zero-shot learning Multi-class single label zero-shot learning has now been widely stud¬ 
ied using attribute-based intermediate semantic layers [□, B, nU, O, 113, E3] or data-driven 
[S, Q, O, EB] representations. However attribute-based strategies have limited ability to scale 
to many classes because the attribute ontology has to be manually defined. To address this 
limitation, Socher et al. [EID] first employed a linguistic model [O] as the intermediate se¬ 
mantic representation. However, this does not model the syntactic and semantic regularities 
in language [E3] which allows vector-oriented reasoning. Such a reasoning is critical for 
our ZS-MLP to synthesise label combination prototypes in the semantic word space. For 
example, Vec{“Moscow’") should be much closer to Vec{“Russia”) -\-Vec{“capitar) than 
Vec{“Russia”) or Vec{“capitar) only. For this purpose, we employ the skip-gram language 
model to learn the word space, which has shown to be able to capture such syntactic regu¬ 
larities [E3, IZ3]. Frome et al. [□] also used the skip-gram language model. They learned 
a visual-semantic embedding model - DeViSE model for single label zero-shot learning 
by projecting both visual and semantic information of auxiliary data into a common space. 
However there are a number of fundamental differences between their work and ours; (1) 
Comparing the DeViSE model with our Mul-DR, the learning of the mapping between im¬ 
ages and the semantic word space by Mul-DR is more explicit and direct. We show in our 
experiments that this leads to better projections and thus better classification performance. 
(2) Our Mul-DR can generalise better to the unseen test classes thanks to our self-training 
based transductive learning strategy. (3) Most critically, we address the multi-label ZSL 
problem whilst they only focused on the single label ZSL problem. Additionally, zero-shot 
learning can be taken as the generalisation of class-incremental learning (C-IL) [i, E3] or 
life-long learning [EB]. 

Our Contributions Overall, we make following contributions: (1) As far as we know this 
is the first work that addresses the multi-label zero-shot learning problem. (2) Our multi¬ 
output deep regression framework exploits correlations across dimensions while learning the 
direct mapping from images to intermediate skip-gram linguistic word space. (3) Within the 
linguistic space, two algorithms are proposed for multi-label ZSL. (4) We propose a simple 
self-training strategy to make the deep regression model generalise better to the unseen test 
classes. (5) Experimental results on benchmark multi-label datasets show the efficacy of our 
framework for multi-label ZSL over a variety of baselines. 

3 Methodology 

3.1 Problem setup 

Suppose we have two datasets - source/auxiliary and target/test. The auxiliary dataset S = 
has ns training instances and test dataset T = {Xj-,7^,Lj-, Wt-} has nj- test 
instances. We use S = {1, • • • ,ns} and U = {ns -f 1, • • • ,nT +ns} to denote the index set 
for instances in auxiliary and test dataset. Xs = {xi, • • • ,x„^} and Xj = {x„^+i, • • • ,x„^+„j,} 
are the raw image data of all auxiliary and test instances respectively. Ys = [y i, • • • , yn 5 ] and 
Yt — [yn 5+1 j • • • ) Yns+nr ] are the intermediate semantic representations of each auxiliary and 
test instance - in our case y, is a 100 dimensional continuous word vector for instance i in 
the skip-gram language model [E3] space. L* = [li, • • • , 1„^] and Lj = [in^+i, • • • are 

the label vectors for auxiliary and test dataset to be predicted respectively. 
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The possible textual labels for each instance in Lj and Lj are denoted TVs = { wi, • • • , } 

and Wt = {wms+i,--- jWms+mr} respectively, where mg and mj- are the total number of 
classes/labels in each dataset. Given a label-space of mj binary labels, an instance x, can 
be tagged with any of the 2"*^ possible label subsets, 1, G {0,1}^ where 1^ = 1 means 
instance i has label j, and l,y = 0 means otherwise. Denoting the power sets of textual labels 
yVs and Wt as V (Ws) and V (Wt), for multi-label classification we need to find the opti¬ 
mal class label set column vector 1 , for the i — th test instance in the power set space V {Wt)- 
At training time Xs,YsjLs,Ws are all observed. At test time only new class names Wt and 
images Xt are given, their representation Yt and multi-label vectors Lt are to be predicted. 

3.2 Learning a semantic word space 

The semantic representations Tj and Yt are the projection of each instance into a linguistic 
word vector space V. The semantic word vector space is learned by using the state-of-the-art 
skip-gram language model [1221, E3] on all English Wikipedia articles*. The space V rep¬ 
resents almost all available English vocabulary and thus is potentially much more effective 
than human annotators to measure subtle similarities and differences between any two tex¬ 
tual labels. Furthermore, V encodes the syntactic and semantic regularities in language [IZ3] 
which allows vector-oriented reasoning by its ‘compositionality’ property. This property 
enables the critical capability of synthesising the exhaustive set of test label combinations 
V (Wt)- Note that cosine distance is used in the space V because of its robustness against 
noise [IZ3, Qj. We use V : W —> V to represent the skip-gram projection from textual con¬ 
cepts (words) in W to vectors in V. Such a semantic space thus captures the correlations 
between labels without any need to collect visual examples - the meaning of multiple labels 
for one instance can be inferred by the sum of the word vector projections of its individual 
labels. Formally, we have 

Ys = v{Ws)-Ls, Yt=v{Wt)-Lt (1) 

where v (W^) and v (W’t’) are the word vector projections of the label class sets in the auxil¬ 
iary and test datasets respectively. The next section discusses how to learn a predictive model 
for Yt given visual data Xt- 

3.3 Multi-output deep regression 

We design a multi-output deep regression (Mul-DR) model f : X to predict the seman¬ 
tic representation Yt GV from images Xt G X where X is the space of raw image pixel 
intensity values. Our Mul-DR is inspired by the recent success of the deep convolutional 
neural network (CNN) features [US, E9] as well as the importance of modelling correlations 
within the semantic representation. The Mul-DR model is a neural network composed of 
nine layers: Layer 1—5 are convolutional layers; Layer 6 — 8 are fully connected layers; 
Layer 9 is the linear mapping layer with 100 least square regressors. 

Two key components contribute to the effectiveness of Mul-DR. The first component 
(layers 1-7) provides state-of-the-art feature extraction for many computer vision tasks [123]. 
It directly maps the raw image to the powerful CNN features^, avoiding the pitful of bad 

*Only articles are used without any user talk/discussion. To 13 Feb. 2014, it includes 2.9 billion words and 4.33 
million vocabulary (single and bi/tri-gram words). 

^However, it has more than 148.3 millions parameters and thus to prevent overfitting on small auxiliary dataset, 
ImageNet with 1.2 million labelled instances are used to train this component [E3]. 
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performance due to “wrong selection” of features for a given dataset. The second component 
(layers 8-9) provides the multi-output neural network (NN) regressors. Different from [lESI, 
123], where the 8-th layer is an output layer for classification, the 8-th layer in our model 
is a fully connected layer of 1024 neurons with Rectified Linear Units (ReLUs) activation 
functions. This soft-thresholding non-linearity has better properties for generalisation than 
the widely used tanh activation units. Such a fully connected layer helps explore correlations 
among the different dimensions in the semantic word space. The final (9-th) layer of least 
square regressors provide an estimation of the 100 dimensional semantic representation in 
the space V. 

To apply this neural network, we resize all images Xs and Xt to 231 x 231 pixels. The 
parameters of the first components are pre-trained using ImageNet [E3] while the parameters 
of the second component are trained by gradient descendent with auxiliary data Xs and ¥$. At 
test time, Mul-DR predicts the semantic word vector y,- for each unseen image x, GXT,iGU. 
Here the hat operator indicates the variable is estimated. 


3.4 Zero-shot multi-label prediction 


Given the estimated semantic representation Yj, we need to infer the labels Lj of the test set. 
A straightforward solution is to decompose the multi-label classification problem into mul¬ 
tiple independent binary classification problems which is equivalent [113] to directly solving 
Eq(l)by: 


Lt 


[v{Wt)Yv{Wt) 


[v(Wt)Y -Yt 


( 2 ) 


where t is the Moore-Penrose pseudo-inverse. Eq (2) directly predicts the labels of each 
instance by a linear transformation of the intermediate representation Yj- In a way, this 
can be considered as an extension of the ‘Direct Attribute Prediction (DAP)’ [113] to the 
case of multi-label and continuous representation. We thus term this method exDAP. How¬ 
ever, this does not exploit the multi-label correlations and thus has very limited expressive 
power [Q, El]. Hence we propose two more principled multi-label zero-shot algorithms - 
Direct Multi-label zero-shot Prediction (DMP) and Transductive Multi-label zero-shot Pre¬ 
dict! on(TraMP). 

Direct Multi-label zero-shot Prediction (DMP) Thanks to the compositionality property 
of V, label-correlation can be explored by synthesising the representation of every possible 
multi-label annotations in V; that is the power set of label vector matrix P — v{V (Wr)) 
where P = [pi, • • • ,P2"'r]- Thus Eq (2) is replaced by a nearest neighbour (NN) classifier 
using all the synthesised instances as training data. The label set 1, of instance i G U with 
representation y,- = /(x,) is then assigned as p^ S v {Viy^r)), where a is the index computed 
by 

a = argmin II y;-p^' II (3) 

i 

where || • || refers to the cosine distance. 

Transductive Multi-label zero-shot Prediction (TraMP) DMP can explore label corre¬ 
lations but only insofar as encoded by the compositionality of the prototypes in V. It would 
be more desirable if the manifold structure of Yj given test instances Xj could be used to 
improve multi-label zero-shot learning, i.e. via transductive learning. We therefore propose 
TramMP, which can be viewed as an extension the TRAM model in [D] for zero-shot learn¬ 
ing, or a semi-supervised generalisation of Eq (3). The key idea is to use the power set 
of prototypes P as a known label set and to perform transductive label propagation from P 





6 


FU FT AL: TRANSDUCTIVE MULTI-LABEL ZERO-SHOT LEARNING 


to the inferred semantic representations Yj. We denote the index of the power set proto¬ 
types as £ = {ns -t-nj’-l-l,-- - ,ns-\-nT 2'"^} and its corresponding class label set as Lp. 
Specifically, we define a k-nearest neighbour (kNN) graph among the test instances Yj and 
prototypes P. For any two instances i and z, where i,z S {U,C}, 


(Oi, = 


2 : exp 




0 


ifz€NNk{fi,[YT,P]) 

otherwise 


(4) 


where a w median || y; — |p. NNt (fi, ) indicates the index set of k-nearest 

neighbors of y; from \Yt,P]. Z,- = pJ) exp is the normalisation 

term to make sure = 1. We define A = I — co and partition the matrix A into blocks, 

Acc Acu 


A = 


A-uc Auu 
closed form solution [O], 


and the label set of test instances can be inferred by the following 


Lt = —AjnjAucLp. 


(5) 


3.5 Generalisation of multi-output deep regression 

As described above, our framework consists of two key steps: applying the multi-output 
deep regression (Mul-DR) model to obtain the estimated semantic representation Yp, and 
followed by applying either DMP or TraMP to predict Lp. There is however an unsolved 
issue, that is, our Mul-DR is learned from the auxiliary data with a different set of labels 
from the target/test data. This projection model is thus not guaranteed to accurately project a 
test image to be near its ground truth label vector in the semantic word space. For example, 
if our Mul-DR is learned to project images of cat and dog to the word vector representation 
of “cat" and “dog" (v(“caf”) and v{“dog")), it may not accurately project an image with 
a person and a chair to its word vector representation of v{“person”) -f v{“chair”) when 
both labels were not available for learning the Mul-DR model. Any regression model will 
have such a generalisation problem especially when the test data are distributed differently 
from the auxiliary data. To make the Mul-DR model generalise better to the target domain, 
we transductively exploit the predicted semantic representation Yp to update the power set of 
label vector matrix P. In this way the target data would be better aligned with the synthesised 
label combination vectors in the semantic word space, thus helping generalise the Mul-DR to 
the target domain. This can be viewed as a semi-supervised learning (SSL) method starting 
from one instance for each label combination if the synthesised prototypes themselves are 
treated as instances. We therefore take a simple SSL strategy and perform one step of self¬ 
training [0] to refine each prototype of P, 


Pi = \ L yT 

yp^ZNN^ippYp) 

where P = [pi, - • • ,P 2 "'r] is the updated prototype matrix and k is the number of nearest 
neighbour^ selected. We use the updated label vector matrix P to compute DMP (Eq (3)) 
and TramMP (Eqs (4) and (5)) in our framework. 

^Note that k is not necessarily with the same k value in Eq (4). 
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4 Experiments 

Datasets Two popular multi-label datasets - Natural Scene [E3] and IAPRTC-12 [O] are 
used to evaluate our framework. Natural Scene consists of 2000 natural scene images where 
each image can be labelled as any combinations of desert, mountains, sea, sunset and trees 
and over 22% of the whole dataset is multi-labelled. For multi-label zero-shot learning 
on Natural Scene, we use a multi-class single label dataset - Scene dataset [IZl] (totally 
2688 images) as the auxiliary dataset which have been labelled with a non-overlapping set 
of labels such as street, coast and highway. IAPRTC-12 consists of 20000 images and a 
total of 275 different labels. The labels are hierarchically organised into 6 main branches: 
humans, animals, food, landscape-nature, man-made and other. Our experiments consider 
the subset of landscape-nature branch (around 9500 images) and use the top 8 most frequent 
labels from this branch with over 30% of multi-label test images. For zero-shot classification 
on this dataset, we employ both Scene and Natural Scene as the auxiliary dataset. 

4.1 Experimental setup 

Evaluation metrics (a) Hamming Loss: it measures the percentage of mismatches be¬ 
tween estimated and ground-truth labels; (b) MicroFl [EB]: it evaluates both micro aver¬ 
age of Precision (Micro-Precision) and micro average of Recall (Micro-Recall) with equal 
importance; (c) Ranking Loss: given the ranked list of predicted labels, it measures the 
number of label pairs that are incorrectly ordered by comparing their confidence scores with 
the ground-truth labels; (d) Average precision: given a ranked list of classes, it measures 
the area under precision-recall curve. These four criteria evaluate very different aspects of 
multi-label classification performance. Usually very few algorithms can achieve the best per¬ 
formance on all metrics. High values are preferred for MicroFl and AP and vice-versa for 
Ranking and Hamming loss. For ease of interpretation we present 1—MicroFl and 1—AP; 
so smaller values for all metrics are preferred. 

Competitors Our full framework includes two main novel components: Mul-DR and 
DMP/TraMP. To evaluate the effectiveness of these two components, we define several com¬ 
petitors by replacing each component with possible alternatives. (1) SVR-rexDAP: Support 
Vector Regression (SVR)'* [□] is used to learn / : A —> V and infer the representation of each 
test instance. Using exDAP (Eq (2)) is a straightforward generalisation of [IH, ED] to multi¬ 
label zero-shot learning. (2) SVR-hDMP: SVR replaces Mul-DR and we further use DMP 
(Eq (3)) for classification; thus it serves as a reference to compare DMP with exDAP. (3) 
DeViSE-rDMP: We use DeViSE [0] to learn the visual-semantic embedding into which the 
power set P is projected. And we use Eq (3) for final labelling in the embedding space, i.e., 
DMP. Thus it corresponds to the extension of [□] to multi-label zero-shot learning problems. 
(4) Mul-DR-rexDAP: Our Mul-DR is used to learn the visual-semantic embedding, with ex¬ 
DAP for multi-label classification; thus it can be used to compare Multi-DR with SVR. (5) 
Mul-DR-rDMP/TraMP: Our method with either of the two proposed ZSL algorithms used. 
Eor fair comparison, all results use self-training strategy in Eq (6) to update the prototypes. 

4.2 Results 

Our Mul-DR model vs. alternatives The results obtained by various competitors on Natural- 
Scene and IAPRTC-12 are shown in Fig. 1. We first compare our Mul-DR with the alter- 

^For fair comparison, we use the CNN features output by the first component (Layer 1-7) of our Mul-DR 
framework as the low-level feature for linear SVR used with the cost parameter set to 10. 
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^■Mul-DR-t-TraMP 


iMQl 


Natura -Scene 


Figure 1: Comparing different zero-shot multi-label classification methods on Natural Scene 
and lAPRTC-12. 


native SVR and DeViSE model for learning the projection from raw images to the semantic 
word space. It is evident that our Mul-DR signihcantly improve the results on conventional 
SVR [ns, ED] regression model (Mul-DR-i-DMP>SVR-i-DMP, Mul-DR-i-exDAP>SVR-i-exDAP) 
This is because that SVR treats each of the 100 semantic word space dimensions indepen¬ 
dently, whilst our multi-output regression model, as well as the DeViSE model [□] capture 
the correlations between different dimensions. Comparing to the DeViSE model [0] (Mul- 
DRh-DMP vs. DeViSEH-DMP), our regression model is also clearly better using three of the 
four evaluation metrics, suggesting that direct and explicit mapping between the image space 
and the semantic word space is a better strategy. The only case where a better result is ob¬ 
tained by DeViSEn-DMP is on the IAPCTC-12 dataset with Hamming Loss. But this result 
is worth further discussion. In particular, we note that Hamming Loss treats the false alarm 
and missing prediction errors equally. However, for multi-label classihcation problem, the 
distribution of labels is very unbalanced and each image usually has only a small portion of 
labels compared to the whole label set. This is particularly the case for IAPCTC-12. The 
good result of DeViSE on IAPCTC-12 with better Hamming loss but worse MicroEl and 
Ranking Loss is an indication that it is mostly predicting no label, and biased against making 
any predictions. This explains the qualitative results of DeViSE shown in Table 1. 

Our DMP/TraMP vs. exDAP Given the same regression model, we compared our DAP 
against the alternative exDAP. The results (SVRH-DMP>SVRH-exDAP, Mu1-DRh-DMP>Mu 1- 
DRn-exDAP) show that our algorithm, which is based on synthesising the label combinations 
in order to encode the multi-label correlations, is superior to exDAP which treats each label 
independently and decomposes the multi-label classification problem as multiple single label 
classihcation problems. Comparing the two proposed algorithms - DMP and TraMP, the 
main difference is that TraMP transductively exploits the manifold structure of the test data 
for label prediction. Eigure 1 shows that this tranductive label prediction algorithm is better 
overall. Specihcally, TraMP has much better Micro-El, Ranking Loss and AP than DMP. 
The NN classiher (Eq (3)) used in DMP is directly minimising the Hamming Loss. This 
explains why TraMP is slightly worse than DMP on IAPCTC-12 on Hamming Loss. 
Effectiveness of the self-training step In this experiment we compare the results of our 
DMP and TraMP with and without the self-training step in Eq (6). We use and ‘H-’ to 
indicate algorithms without and with self-training respectively. Both DMP and TraMP use 
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Figure 2: Effectiveness of self-training on DMP and TraMP. 



Groundtruth 

MuI-DRh-DMP 


sand-beach, 
mountain, sky 
sand-beach, 
sky 


landscape-nature, 
mountain, sky 
landscape-nature, 
mountain, sky 


grass 

grass 


sand-beach, 

sky 

sand-beach, 

sky 


Mul-DRH-TraMP 


sand-beach, 
mountain, sky 


landscape-nature, 
mountain, sky 


grass, ground, 
landscape-nature 


ground, sky, 
sand-beach 


DeViSEn-DMP sky 

Table 1: Examples of multi-label zero-shot predictions on IAPRTC-12 dataset 
frequent labels of landscape-nature branch are considered. 


sky 

Top 8 most 


Mul-DR to infer the word vector Yj. As shown in Eig. 2, the self-training step clearly has 
a positive influence on the multi-label prediction peiformance. This result suggests that this 
simple step is helpful in making the learned Mul-DR model from the auxiliary data generalise 
better to the target data. 

Qualitative results Table 1 gives a qualitative comparison of multi-label annotation by our 
DMP and TraMP with DeViSE on IAPCTC-12. As discussed, DeViSE is too conservative 
on this dataset and assigns no label to most instances. 

5 Conclusion and future work 

We have for the first time generalised zero-shot learning from the single label to the multi¬ 
label setting. It is somewhat surprising that it turns out to be possible to exploit label corre¬ 
lation at test time in the zero shot case - since there is no dataset of examples to learn co- 
occurance statistics in the conventional way. We achieve this via introducing novel strategies 
to exploit the compositionality of the semantic word space, and by transductively exploiting 
the unlabelled test data. 

Besides the proposed tailor-made multi-label algorithms - DMP and TraMP, our strategy 
could potentially help other existing multi-label algorithms to generalise to the multi-label 
zero-shot learning problem. Einally, we note that many prototypes of the power set P actually 
have an extremely low chance to occur in the test dataset. They should not be considered in 
the same way as the other more likely prototypes. Thus another line of ongoing research is 
to investigate how to prune low-probability prototypes from the power set P. 
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